TAKE AWAY POINTS FROM THIS POST
With ggmap you can use Google Maps within the ggplot2 environment.
Delaunay triangulation is a way to compute at the area and centroid of a strangely shaped polygon, and can be computed in R with the deldir package.
The package gganimate can be used to show time series data as an animation.
For some reason Barcelona only felt the need for one metro stop in 1912.
Today’s data is the location of stops in metro systems in European four cities: Paris, Berlin, Barcelona, and Prague. To collected the names of stops from each cities respective Wikipedia article. I also coded if the stop actually was in the city being analyzed or a different top, usually bordering the city.
With my data in place I began to work with it in R to organize it. I used three packages to start off, dplyr, tidyr, and ggmap. The packages dplyr and tidyr have been discussed previously in this blog [ADD LINK], but ggmap is new. With ggmap you can download maps from various sources, including Google Maps, and plot with it in the ggplot2 environment. I first read in my data and then create a new column called “geo_location” by combining the “station” and “location” columns with a “unite()” call. I also use a “separate()” call, the converse of “unite()” to split the “opened” column (which refers to the date when the stop was opened) into three columns, one for month, day and year. now I get to use my first ggmap call, “mutate_geocode()”. I can feed the call my “geo_location” column from my data frame and it will make two new columns, “lon” and “lat”, find the longitude and latitude of each stop, and add these values to my new columns. Note, I originally tried added the word “Station” at the end of the stop for all stops but this caused problems. Also, the output from Google Maps is not exactly the same as the Google Maps API. I tried to correct errors as much as possible, but I am not an expert on European Metro systems. If you see an erroneous data point from your city feel free to let me know! Below you can see a table of the data we’ve created.
library(dplyr)
library(tidyr)
library(ggmap)
# data = read.table("data_metros.txt", header=T, sep="\t") %>%
# unite(geo_location, c(station, location), sep = ", ", remove = FALSE) %>%
# separate(opened, into = c("opened_month", "opened_day", "opened_year"), sep = "/") %>%
# mutate_geocode(geo_location, source = "google")
data = read.table("data_metro_full.txt", header = T, sep="\t")
With our data in place we can start making our maps. This brings us to our second ggmaps call, “get_googlemap()”. With this call I can download city specific maps for my four cities by setting “center” to each of my cities. I can also set the type of map (terrain, satellite, roadmap, hybrid), how close to zoom in (integers that range from continent to building), the size of my map in pixels, and if I want the map in black and white or color.
paris_map = get_googlemap(center = "Paris", maptype = "roadmap",
zoom = 11, size = c(640, 420), color = "bw")
berlin_map = get_googlemap(center = "Berlin", maptype = "roadmap",
zoom = 10, size = c(640, 420), color = "bw")
barcelona_map = get_googlemap(center = "Barcelona", maptype = "roadmap",
zoom = 11, size = c(640, 420), color = "bw")
prague_map = get_googlemap(center = "Prague", maptype = "roadmap",
zoom = 11, size = c(640, 420), color = "bw")
With our map objects saved from Google can now plot our maps and our metro stops on top. Since I’ll be making roughly the same plot each time I wrote a function which you can see below. The main difference from a typical ggplot2 plot is instead of using “ggplot()” to start off the plot you use “ggmap()” and then feed it the map we had saved. The setting “extent =”device" " is used to suppress the x and y axes with their tick marks. From then on it takes the same ggplot2 calls as any other plot. For example, we can use “geom_point()” to plot our metro stops. See the maps with metro stops for the four cities below. I’ve included the code for the Paris map for example, but hidden the rest since it basically the same.
city_plot = function(city_name, city_map){
ggmap(city_map, extent = "device") +
geom_point(data = subset(data, city == city_name), aes(x = lon, y = lat),
color = "#0571b0", size = 4)
}
paris.plot = city_plot("Paris", paris_map)
paris.plot
unnamed-chunk-6
unnamed-chunk-7
unnamed-chunk-8
unnamed-chunk-9
With our maps and data points in place let’s compute the Delaunay Triangulation for each city. This will let us find the area of the outer hull of the area a given city’s metro covers, and allows us to computer a center point, or centroid, for the metro system. We do this with the deldir package. First though, I am going to use a new function from tidyr called “nest()” which allows me to collapse a bunch of data into a single cell. By nesting by “city” I get one row for each city and then the rest of the data for each column is a list of values in one cell. Additionally, I can collapse all of my other columns into a single column using “.key”, in this case this new column is called “location_info”. Think of it as a data frame tucked within a cell of a data frame. With my data nested I can make a new column called “deldir” that will have all of the information from my “deldir()” call. The “deldir()” call simply takes two lists of continuous data points. It then computes several things, including the area of the shape and the edges of all the segments connecting the points. How do we access this information though? We can pull this information out with a purrr call, “map()”. The “map()” call takes in some data and a function and applies the data to the function in an iterative fashion. For our purposes though we’re saying we want to take the data in the form of the column “deldir” and pull out the “del.area”. Thanks to our “mutate()” call we then save it to a new column. We can do the same thing with “delsgs” (the segments of the shape) and “summary” (more information about the individual triangles). See the full nested data frame below.
library(purrr)
library(deldir)
make_deldir = function(df){
deldir(df$lon, df$lat)
}
data_deldir = data %>%
nest(-city, .key = location_info) %>%
mutate(deldir = map(location_info, make_deldir)) %>%
mutate(del.area = map(deldir, "del.area")) %>%
mutate(delsgs = map(deldir, "delsgs")) %>%
mutate(summary = map(deldir, "summary"))
Based on these areas it looks like the Berlin metro covers the most area at 0.059279 while Barcelona covers the smallest area at 0.016332. Now that we have our nested data frame with all pertinent information we’re going to unnest the data necessary for our new plots. First we need the “delsgs” data, which we use to draw the lines connecting the metro stops. To do this we’ll make a new data frame, dropping all columns except for “city” and “delsgs”. Then we “unnest()” the data frame. This will expand the “delsgs” column that had nested values, giving us many more rows and many more columns. The x1, y1, x2, and y1 values will be used later in our plot to draw the edges of our triangles. See the unnested data frame below.
data_deldir_delsgs = data_deldir %>%
select(city, delsgs) %>%
unnest()
In addition to the edges of my shape, we also want the centroid. To do this we’ll first make a new data frame focusing on just the city and summary information. We then “unnest()” the data frame just as we did for the edges, however we don’t stop here. What we’re really interested in is the centroid, which we need to compute ourselves. To do this we’ll first “group_by()” city. Then we’re going to “summarise()” the data. To compute the x-value for the centroid, “cent_x”, we’re going to take the “x” column, which contains the x-coordinates of all of the points, and multiply each point by the “del.wts” column, which contains the weights of the areas of each triangle. By adding these numbers together we get the x-value of the centroid of the entire figure. We can do the same thing for the y-value. See the table below for the data summarised to give us the centroids for each city.
data_deldir_cent = data_deldir %>%
select(city, summary) %>%
unnest() %>%
group_by(city) %>%
summarise(cent_x = sum(x * del.wts),
cent_y = sum(y * del.wts)) %>%
ungroup()
Now we can update our figures with the triangulations and centroids. I’ve again made a function to build the four maps. As before we start with “ggmap()” and our city specific map object. Next we’ll use “geom_segment()” to draw our edges. To do this we’ll use “x1”, “y1”, “x2”, and “y2” from our “data_deldir_delsgs” data frame we made earlier. We then plot the actual metro stop points just as we did in our original map with “geom_point()”. Finally we end with one more “geom_point()” call, this time on our “data_deldir_cent” data frame to plot the centroid specific to each city. See the four updated maps below, again I’ve left the code visible for the Paris map to see how the function works.
del_plot = function(city_name, city_map){
ggmap(city_map, extent = "device") +
geom_segment(data = subset(data_deldir_delsgs, city == city_name), aes(x = x1, y = y1, xend = x2, yend = y2),
size = 1.5, color= "#92c5de") +
geom_point(data = subset(data, city == city_name), aes(x = lon, y = lat),
color = "#0571b0", size = 4) +
geom_point(data = subset(data_deldir_cent, city == city_name),
aes(x = cent_x, y = cent_y),
size = 8, color= "#ca0020")
}
paris_del.plot = del_plot("Paris", paris_map)
paris_del.plot
unnamed-chunk-17
unnamed-chunk-18
unnamed-chunk-19
unnamed-chunk-20
We now have a good sense of what each city’s current metro system looks like, but how did these systems come to be this way? Now we’ll look at how these systems have changed and grown over time. That’s why at the beginning we made a column for “opened_year”. At this point the code gets less elegant but we’ll go through it step by step. It’s all the same principles as when we made our figures earlier.
The main idea of the following code is that we’re going to create unique triangulations for each year within each city. As more metro stations get added each year the triangulation will change. Just as we had “data_deldir_delsgs” and “data_deldir_cent”, we’re going to start by creating two empty data frames “time_deldir_delsgs” and “data_deldir_sum” (remember that our centroid data frame was based on the summary data). With our empty data frames initialized we can make a for loop. We want to go through each year, but for each city separately, so our first for loop goes through each city, filtering our data to only the city in question. Next we have our second for loop where we go through each year starting with the minimum year in the data for that city and up to 2015, the maximum year for the full data set. For a given year we filter to include only metro stops that were opened that year or earlier. Note, we need at least three points to make a triangle, and you may think that a city wouldn’t ever have only one or two metro stops but you would be wrong (cough Barcelona cough) so we’re going to put a stop gap saying if the number of data points is less than three the loop should skip that year and move to the next one.
Okay, assuming there are at least three data points though we’re going to run the “deldir()” call and then save it in “year_deldir”. Then we create two new data frames. the first is “year_deldir_delsgs” which contains the “delsgs” information from “deldir”. We’re going to add two columns though “city” and “opened_year” so we know which city and year this data comes from. We then add this information to our existing “time_deldir_delsgs” data frame with an “rbind()” call. We then do the same thing to create “year_deldir_sum” only we pull out the “summary” information from “year_deldir” instead of the “delsgs” information. We also add our “city” and “opened_year” columns and then “rbind()” it with “time_deldir_sum”. The loop does this for every city from the minimum year in the data up to 2015. See below for the two data frames we created.
time_deldir_delsgs = data.frame()
time_deldir_sum = data.frame()
for(c in c("Paris", "Berlin", "Barcelona", "Prague")) {
data_city = filter(data, city == c)
for(year in min(data_city$opened_year):2015) {
data_year = filter(data_city, opened_year <= year)
# Add condition to skip if number of stops less than 3
if(dim(data_year)[1] < 3) next
year_deldir = deldir(data_year$lon, data_year$lat)
year_deldir_delsgs = year_deldir$delsgs %>%
mutate(city = c) %>%
mutate(opened_year = year)
time_deldir_delsgs = rbind(time_deldir_delsgs, year_deldir_delsgs)
year_deldir_sum = year_deldir$summary %>%
mutate(city = c) %>%
mutate(opened_year = year)
time_deldir_sum = rbind(time_deldir_sum, year_deldir_sum)
}
}
As you may recall though we’re not necessarily interested in all the summary information, we just want it to compute our centroid. So, we make a new data frame “time_deldir_cent”. The code is the same as our earlier code for computing centroids, the only difference is that we’ll also group by “opened_year”, not just “city”, since we want unique centroids for each year for each city. See data frame of the centroids below.
time_deldir_cent = time_deldir_sum %>%
group_by(city, opened_year) %>%
summarise(cent_x = sum(x * del.wts),
cent_y = sum(y * del.wts)) %>%
ungroup()
There’s still one more thing I want to do before we make our figures. Right now the figures will have different start dates depending on when the first metro stop was built in a given city. Instead, I want all figures to start at the same year so we see the change over time with the same start date for each city. To this we’ll make a new data frame called “years” that simply lists the years 1900 to 2015 four times, once for each city. We then do a “left_join()” with our data. As a result any time the “opened_year” in question is not found in the data frame for a given city an empty row will be added, empty except for the “opened_year” and “city” values.
years = data.frame(opened_year = rep(seq(1900, 2015), 4),
city = c(rep("Paris", 116), rep("Berlin", 116),
rep("Barcelona", 116), rep("Prague", 116)))
data_time = left_join(years, data) %>%
mutate(opened_by_year = ifelse(opened_year %% 10 == 0, opened_year,
opened_year + (10 - (opened_year %% 10)))) %>%
filter(opened_by_year <= 2010)
time_deldir_delsgs_sub = time_deldir_delsgs %>%
filter(opened_year %% 10 == 0)
time_deldir_cent_sub = time_deldir_cent %>%
filter(opened_year %% 10 == 0)
I kept saying we were going to make maps showing the change over time, but how are we going to do that? Well instead of building a single static plot for each city we’re going to build an animation where as the year changes so will the map. To do this we’ll use the package gganimate which works on top of ggplot2 (which is useful since we’re already using ggmap which works on top of ggplot2). We build our plot just as we would any other ggplot2 figure, but for data we want to change we add the “frame” setting. The “frame” is the thing in the plot that changes, in our case “opened_year”. Also, while we only to plot want the triangulations and centroids specific to a given year, we want the points from for the metro stops to be additive. For example, when “frame” is 2000 we still want the points from 1990 to be plotted. To do this we add “cumulative = TRUE” to the call for those points. Also, since we updated our data to include empty rows that all plots start on 1900 all plots will have a frame starting at 1900, even if there are no data points to plot. I’ve again made function to make our plots. See below for the code for the Paris map as well as all four animations. Also, notice that in 1912 Barcelona gets their first metro stop…but doesn’t get anymore until 1924. Take a look to see if you can find any other interesting things about how the systems changed over time.
library(gganimate)
time_plot = function(city_name, city_map){
time.plot = ggmap(city_map, extent = "device") +
geom_segment(data = subset(time_deldir_delsgs_sub, city == city_name),
aes(x = x1, y = y1, xend = x2, yend = y2, frame = opened_year),
size = 1.5, color= "#92c5de") +
geom_point(data = subset(data_time, city == city_name),
aes(x = lon, y = lat, frame = opened_by_year, cumulative = TRUE),
color = "#0571b0", size = 4) +
geom_point(data = subset(time_deldir_cent_sub, city == city_name),
aes(x = cent_x, y = cent_y, frame = opened_year),
size = 8, color= "#ca0020")
}
paris_time.plot = time_plot("Paris", paris_map)
gg_animate(paris_time.plot)
unnamed-chunk-27